Machine Learning Time Series Forecasting (LSTM) on OMRON connect Data

May 2021 ~ OMRON Healthcare Europe

Length:   0.5 mo (at 1.0 FTE)

Programming languages:    Python (Pandas, time, datetime, Math, Matplotlib, NumPy, scikit-learn, TensorFlow)

Data:  Over 4 million blood pressure measurements registered via OMRON connect by approximately 35 000 users, containing the recorded systolic, the device used, the time and date of each measurement

Problem description:
Build a multivariate, multistep, single-output LSTM that predicts the following two weekly averages of systolic measurements of active OMRON connect users

Approach:
On top of the pre-processing done in the previous project (see the first paragraph from Approach and Results on Big Data Analysis with PySpark on OMRON connect data), the predictor features which will not be known ex-post were removed, such as the diastolic and pulse. However, it was assumed that the users would perform measurements with the same device because about 1% of the users changed their devices while using the app. Hence, the device-related variables were kept. Then, the measurements of each user were aggregated per week and structured relatively to their start date to allow modeling for multiple users. Afterward, the categorical features were one-hot encoded, and the continuous ones were standardized or normalized.

Prior to modeling, the data was reshaped as tensors. Then, a sliding window function for creating lags and the validation dataset was designed. Subsequently, the sequential model was built to have four hidden layers, namely two pairs of one LSTM layer with 100 units followed by a dropout layer to prevent overfitting. The simplified architecture of the neural network is visible below.

LSTM Neural Network Architecture

Finally, the main hyperparameters of the model, namely learning rate, optimizer, number of batches, epochs, and steps per epoch, were tuned using the Grid Search method with time-series cross-validation.

Results:
Before the predictions could be compared against the withhold test set, they were scaled back to their original magnitude. Consequently, a baseline that naively estimates the next two weekly averages of each user as the last one registered was implemented. Its RMSE score with the test data was 7.31, while the LSTM model scored an RMSE of 4,97. Therefore, the LSTM network performed 47% better than the naive baseline.

  • Address

    Amsterdam, the Netherlands